我们提出了EasyRec,这是一个易于使用,可扩展和高效的推荐框架,用于构建工业推荐系统。我们的EasyRec框架在以下方面是优越的:首先,EasyRec采用模块化和可插入的设计模式来减少建立定制模型的努力;其次,EasyRec实现了超参数优化和特征选择算法,以自动提高模型性能;第三,EasyRec应用在线学习,以快速适应不断变化的数据分布。该代码发布:https://github.com/alibaba/easyrec。
translated by 谷歌翻译
随着大型预训练的Vison语言模型(如剪辑)的出现,可以通过及时调整来调整可转让表示形式。及时调整试图从存储在预训练的视觉模型的图像和文本编码器中的常识中探索有益信息,以探索下游任务。最近提出的名为“上下文优化”(COP)的方法将一组可学习的向量从语言侧引入文本提示符,而单独调整文本提示符则不会影响图像编码器的计算视觉特征,从而导致了次级优势。在本文中,我们通过学习文本提示并同时为文本和图像编码器提供双重模式提示调整范式。此外,为了使视觉提示更多地集中在目标视觉概念上,我们提出了类感知的视觉及时调整(CAVPT),该调整是通过在模板提示和视觉类别令牌嵌入的语言描述之间进行交叉注意来动态生成的。我们的方法提供了一种新的范式来调整大型预训练的视觉模型,并在8个数据集上进行了广泛的实验结果,证明了该方法的有效性。我们的代码在补充材料中可用。
translated by 谷歌翻译
The three existing dominant network families, i.e., CNNs, Transformers, and MLPs, differ from each other mainly in the ways of fusing spatial contextual information, leaving designing more effective token-mixing mechanisms at the core of backbone architecture development. In this work, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate flexible contextual information distributed across different channels from other tokens into the given query token. This fundamental operator actively predicts where to capture useful contexts and learns how to fuse the captured contexts with the query token at channel level. In this way, the spatial range of token-mixing can be expanded to a global scope with limited computational complexity, where the way of token-mixing is reformed. We take ATM as the primary operator and assemble ATMs into a cascade architecture, dubbed ATMNet. Extensive experiments demonstrate that ATMNet is generally applicable and comprehensively surpasses different families of SOTA vision backbones by a clear margin on a broad range of vision tasks, including visual recognition and dense prediction tasks. Code is available at https://github.com/microsoft/ActiveMLP.
translated by 谷歌翻译
最近的工作[4]分析了两次可差化函数的最佳解决方案附近的亚当局部融合。结果发现,学习率必须足够小,以确保最佳解决方案的局部稳定性。以上的收敛结果也适用于Adamw。在这项工作中,我们提出了一种新的自适应优化方法,通过在两个方面扩展Adamw,以便放宽对局部稳定性的小型学习率的要求,我们称之为AIDA。首先,我们考虑跟踪梯度大小的第2矩R_T。当p = 2时,r_t减少到adamw的v_t。假设{m_t}是adamw的第一个时刻。众所周知,更新方向M_ {T + 1} /(v_ {t + 1} + epsilon)^ 0.5(或m_ {t + 1} /(v_ {t + 1} ^ 0.5 + epsilon)的Adamw(或者亚当)可以被分解为标志向量符号(M_ {t + 1})通过幅度的向量乘以量数| m_ {t + 1} | /(v_ {t + 1} + epsilon)^ 0.5(或| m_ {t + 1} | /(v_ {t + 1} ^ 0.5 + epsilon)。Aida旨在以| m_ {t + 1} | ^ q /(r_ {t + 1} + epsilon)^(q / p)(或| m_ {t + 1} | ^ q /((r_ {t + 1})^(q / p)+ epsilon),减少到当(p,q)=(2,1)时的adamw。假设原点0是两次可差化函数的本地最佳解决方案。理论上是在理论上发现的,当Q> 1和P> 1在Aida中,原点0只有当重量衰减是非零时局部稳定。进行实验,用于解决十个玩具优化问题和训练变压器和训练变压器和Swin变压器,为两个深度学习(DL)任务。实证研究表明,在许多场景中表明(包括两个DL任务),AIDA具有特定设置(P,Q)不等于(2,1)优于Adamw的设置(p,q)=(2,1)。
translated by 谷歌翻译
深度学习中的混乱是一般不利的,在他们渗透特征陈述的普遍之规方面都有害。因此,学习没有干扰混淆的因果特征很重要。基于最先前的因果学习方法采用后门标准来减轻某些特定混淆的不利影响,这需要明确的混淆识别。然而,在真实的情景中,混乱通常是多种多样的,并且难以被识别。在本文中,我们提出了一种新的混淆器识别因果视觉特征学习(CICF)方法,这避免了识别混淆的需求。 CICF基于前门标准模拟不同样本中的干预,然后从优化的角度近似于对实例级干预的全局范围中间效应。通过这种方式,我们的目标是找到可靠的优化方向,避免了混淆的介入效果,以学习因果特征。此外,我们发现CICF与流行的元学习策略MAML之间的关系,并提供了MAML首次从因果学习的理论视角来解释为什么MAML工作。由于有效地学习了因果特征,我们的CICF使模型能够具有卓越的泛化能力。域泛化基准数据集的广泛实验证明了我们的CICF的有效性,从而实现了最先进的性能。
translated by 谷歌翻译
跨域建议可以帮助缓解传统的连续推荐系统中的数据稀疏问题。在本文中,我们提出了Recguru算法框架,以在顺序推荐中生成包含跨域的用户信息的广义用户表示,即使在两个域中的最小或没有公共用户时也是如此。我们提出了一种自我细心的AutoEncoder来导出潜在用户表示,以及域鉴别器,其旨在预测所产生的潜在表示的原点域。我们提出了一种新的逆势学习方法来训练两个模块,以使从不同域生成的用户嵌入到每个用户的单个全局Gur。学习的Gur捕获了用户的整体偏好和特征,因此可以用于增强行为数据并改进在涉及用户的任何单个域中的推荐。在两个公共交叉域推荐数据集以及从现实世界应用程序收集的大型数据集进行了广泛的实验。结果表明,Recguru提高了性能,优于各种最先进的顺序推荐和跨域推荐方法。收集的数据将被释放以促进未来的研究。
translated by 谷歌翻译
视频摘要方法通常分为射击级或帧级方法,这些方法以一般方式单独使用。本文研究了框架级别和射击级方法之间的潜在互补性,并提出了一种堆叠合奏方法以进行监督视频摘要。首先,我们建立了一个堆叠模型,以同时预测关键框架概率和时间兴趣段。然后通过软决策融合组合两个组件,以获得视频中每个帧的最终分数。这里提出了联合损失函数来训练模型。消融实验结果表明,所提出的方法的表现优于两个相应的单个方法。此外,与最先进的方法相比,两个基准数据集的广泛实验和分析证明了我们方法的有效性及其出色的性能。
translated by 谷歌翻译
由于其实际意义,跨情态人重新识别的问题已得到越来越多的关注。由于人类通常会在比较两个类似的物体时参加差异的事实,我们提出了一种双径跨模型特征学习框架,其保留了内在空间缩小,并参加了输入跨模型图像对的差异。我们的框架由两个主要组件组成:双路径空间结构保留公共空间网络(DSCSN)和对比相关网络(CCN)。前者将跨型号图像嵌入到共同的3D张量空间而不失去空间结构,而后者通过动态比较输入图像对提取对比特征。注意,为输入RGB和红外图像生成的表示彼此相互依赖。我们对两个公共可用RGB-IR REID数据集,SYSU-MM01和REGDB进行了广泛的实验,我们提出的方法优于完整和简化的评估模式的大边距优于最先进的算法。
translated by 谷歌翻译
Vision transformer has demonstrated great potential in abundant vision tasks. However, it also inevitably suffers from poor generalization capability when the distribution shift occurs in testing (i.e., out-of-distribution data). To mitigate this issue, we propose a novel method, Semantic-aware Message Broadcasting (SAMB), which enables more informative and flexible feature alignment for unsupervised domain adaptation (UDA). Particularly, we study the attention module in the vision transformer and notice that the alignment space using one global class token lacks enough flexibility, where it interacts information with all image tokens in the same manner but ignores the rich semantics of different regions. In this paper, we aim to improve the richness of the alignment features by enabling semantic-aware adaptive message broadcasting. Particularly, we introduce a group of learned group tokens as nodes to aggregate the global information from all image tokens, but encourage different group tokens to adaptively focus on the message broadcasting to different semantic regions. In this way, our message broadcasting encourages the group tokens to learn more informative and diverse information for effective domain alignment. Moreover, we systematically study the effects of adversarial-based feature alignment (ADA) and pseudo-label based self-training (PST) on UDA. We find that one simple two-stage training strategy with the cooperation of ADA and PST can further improve the adaptation capability of the vision transformer. Extensive experiments on DomainNet, OfficeHome, and VisDA-2017 demonstrate the effectiveness of our methods for UDA.
translated by 谷歌翻译
In RGB-D based 6D pose estimation, direct regression approaches can directly predict the 3D rotation and translation from RGB-D data, allowing for quick deployment and efficient inference. However, directly regressing the absolute translation of the pose suffers from diverse object translation distribution between the training and testing datasets, which is usually caused by the diversity of pose distribution of objects in 3D physical space. To this end, we generalize the pin-hole camera projection model to a residual-based projection model and propose the projective residual regression (Res6D) mechanism. Given a reference point for each object in an RGB-D image, Res6D not only reduces the distribution gap and shrinks the regression target to a small range by regressing the residual between the target and the reference point, but also aligns its output residual and its input to follow the projection equation between the 2D plane and 3D space. By plugging Res6D into the latest direct regression methods, we achieve state-of-the-art overall results on datasets including Occlusion LineMOD (ADD(S): 79.7%), LineMOD (ADD(S): 99.5%), and YCB-Video datasets (AUC of ADD(S): 95.4%).
translated by 谷歌翻译